22 research outputs found
Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR
This paper presents a novel approach to speaker orientation
estimation in a SmartRoom environment equipped with
multiple microphones. The ratio between the high and low
band energies (HLBR) received at each microphone has been
shown in our previous work to be a potentially approach to estimate
the direction of the voice produced by a speaker. In this
work, for each microphone pair, a smoothed CPS phase is obtained
by a proper windowing of the main peak of the crosscorrelation
sequence estimated with the GCC-PHAT method,
and a HLBR is computed from the processed CPS. The proposed
method keeps the computational simplicity of the HLBR
algorithm while adding the robustness offered by the GCCPHAT
technique. Experimental preliminary results were conducted
over a database recorded purposely in the UPC Smart
room, and over the CLEAR head pose database. The proposed
method performs consistently better than other state-of-the-art
techniques with both databases.Postprint (published version
Efficient keyword spotting by capturing long-range interactions with temporal lambda networks
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel ResNet-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (ResNet15) counterparts while being up to 100× faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and therefore, we unravel further research of new interfaces based on this architecture.Peer ReviewedPostprint (author's final draft
Two-source acoustic event detection and localization: online implementation in a smart-room
Real-time processing is a requirement for many practical signal processing applications. In this work we implemented online 2-source acoustic event detection and localization algorithms in a Smart-room, a closed space equipped with multiple microphones. Acoustic event detection is based on HMMs that enable to process the input audio signal with very low latency; acoustic source localization is based on the SRP-PHAT localization method which is known to per-form robustly in most scenarios. The experimental results from online tests show high recognition accuracy for most of acoustic events both isolated and overlapped with speech.Peer ReviewedPostprint (published version
Audiovisual head orientation estimation with particle filtering in multisensor scenarios
This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the individuals head orientation is the basis for many forms of more sophisticated interactions between humans and technical devices and can also be used for automatic sensor selection (camera, microphone) in communications or video surveillance systems. The use of particle filters as a unified framework for the estimation of the head orientation for both monomodal and multimodal cases is proposed. In video, we estimate head orientation from color information by exploiting spatial redundancy among cameras. Audio information is processed to estimate the direction of the voice produced by a speaker making use of the directivity characteristics of the head radiation pattern. Furthermore, two different particle filter multimodal information fusion schemes for combining the audio and video streams are analyzed in terms of accuracy and robustness. In the first one, fusion is performed at a decision level by combining each monomodal head pose estimation, while the second one uses a joint estimation system combining information at data level. Experimental results conducted over the CLEAR 2006 evaluation database are reported and the comparison of the proposed multimodal head pose estimation algorithms with the reference monomodal approaches proves the effectiveness of the proposed approach.Postprint (published version
Acoustic event detection based on feature-level fusion of audio and video modalities
Research articleAcoustic event detection (AED) aims at determining the identity of sounds and their temporal position in audio signals. When
applied to spontaneously generated acoustic events, AED based only on audio information shows a large amount of errors, which are mostly due to temporal overlaps. Actually, temporal overlaps accounted for more than 70% of errors in the realworld interactive seminar recordings used in CLEAR 2007 evaluations. In this paper, we improve the recognition rate of acoustic events using information from both audio and video modalities. First, the acoustic data are processed to obtain both a set of spectrotemporal features and the 3D localization coordinates of the sound source. Second, a number of features are extracted from video recordings by means of object detection, motion analysis, and multicamera person tracking to represent the visual counterpart of several acoustic events. A feature-level fusion strategy is used, and a parallel structure of binary HMM-based detectors is employed in our work. The experimental results show that information from both the microphone array and video cameras is useful to improve the detection rate of isolated as well as spontaneously generated acoustic events.Peer ReviewedPostprint (published version
Miradas y voces de la Investigación Educativa IV : Instituciones y sistemas educativos. Innovación Educativa con miras a la justicia social. Aportes desde la Investigación Educativa
Fil: Ferreyra, Horacio Ademar. Universidad Católica de Córdoba. Facultad de Educación; ArgentinaFil: Sañudo Guerra, Lya. Universidad Católica de Córdoba. Facultad de Educación; ArgentinaFil: Universidad Católica de Córdoba; Argentina.Fil: Red de Posgrados en Educación AC; México
Clonal chromosomal mosaicism and loss of chromosome Y in elderly men increase vulnerability for SARS-CoV-2
The pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, COVID-19) had an estimated overall case fatality ratio of 1.38% (pre-vaccination), being 53% higher in males and increasing exponentially with age. Among 9578 individuals diagnosed with COVID-19 in the SCOURGE study, we found 133 cases (1.42%) with detectable clonal mosaicism for chromosome alterations (mCA) and 226 males (5.08%) with acquired loss of chromosome Y (LOY). Individuals with clonal mosaic events (mCA and/or LOY) showed a 54% increase in the risk of COVID-19 lethality. LOY is associated with transcriptomic biomarkers of immune dysfunction, pro-coagulation activity and cardiovascular risk. Interferon-induced genes involved in the initial immune response to SARS-CoV-2 are also down-regulated in LOY. Thus, mCA and LOY underlie at least part of the sex-biased severity and mortality of COVID-19 in aging patients. Given its potential therapeutic and prognostic relevance, evaluation of clonal mosaicism should be implemented as biomarker of COVID-19 severity in elderly people. Among 9578 individuals diagnosed with COVID-19 in the SCOURGE study, individuals with clonal mosaic events (clonal mosaicism for chromosome alterations and/or loss of chromosome Y) showed an increased risk of COVID-19 lethality
Overlap detection for speaker diarization by fusing spectral and spatial features
A substantial portion of errors of the conventional speaker
diarization systems on meeting data can be accounted to overlapped
speech. This paper proposes the use of several spatial
features to improve speech overlap detection on distant channel
microphones. These spatial features are integrated into a
spectral-based system by using principal component analysis
and neural networks. Different overlap detection hypotheses
are used to improve diarization performance with both overlap
exclusion and overlap labeling. In experiments conducted
on AMI Meeting Corpus we demonstrate a relative DER improvement
of 11.6% and 14.6% for single- and multi-site data,
respectively
Overlap detection for speaker diarization by fusing spectral and spatial features
A substantial portion of errors of the conventional speaker
diarization systems on meeting data can be accounted to overlapped
speech. This paper proposes the use of several spatial
features to improve speech overlap detection on distant channel
microphones. These spatial features are integrated into a
spectral-based system by using principal component analysis
and neural networks. Different overlap detection hypotheses
are used to improve diarization performance with both overlap
exclusion and overlap labeling. In experiments conducted
on AMI Meeting Corpus we demonstrate a relative DER improvement
of 11.6% and 14.6% for single- and multi-site data,
respectively.Postprint (published version
Overlap detection for speaker diarization by fusing spectral and spatial features
A substantial portion of errors of the conventional speaker
diarization systems on meeting data can be accounted to overlapped
speech. This paper proposes the use of several spatial
features to improve speech overlap detection on distant channel
microphones. These spatial features are integrated into a
spectral-based system by using principal component analysis
and neural networks. Different overlap detection hypotheses
are used to improve diarization performance with both overlap
exclusion and overlap labeling. In experiments conducted
on AMI Meeting Corpus we demonstrate a relative DER improvement
of 11.6% and 14.6% for single- and multi-site data,
respectively